Statistical modeling to analyze factors affecting the middle-income trap in Indonesia using panel data regression

Review Highlights • Regression modeling with panel data.• The observation units used are provinces in Indonesia based on inter-regional disaggregation variables in modeling the middle-income trap problem.


Introduction
The term "Middle Income Trap " (MIT) was first used in 2004 [1] .It describes a country where the average income per person can reach middle-class levels but stays for a long time, and it is hard to move out [2][3][4] .It is assumed that this income stagnation occurs because of the inability of MIT countries to compete with high-income countries (which have an economy and quality of science that tends to be competitive).The spectrum of MIT discussion is relatively broad, starting from economic growth, the effects of social conditions, the effects of spatial development, to specialization,and innovation [ 4 , 5 ].
In 2014, Indonesia, a country with 34 provinces, had been stuck in MIT for more than 28 years and had a per capita income in the lower middle class [6] .In 2019, the Gross National Income (GNI) per person in Indonesia was $4050, which put it in the upper-middle income (UM) group of countries.Still, on July 1, 2020, Indonesia returned to the middle-low-income countries with a GNI per capita of US$ 3870.The economic contraction in Indonesia in 2020 due to COVID-19 has resulted in a correction to the level of welfare and Indonesia's status which has re-entered the lower middle-income country.In 2020, the growth rate of per capita Gross Regional Domestic Product (GRDP) at constant prices in almost all provinces in Indonesia shows a negative number.This indicates that the income class of people in Indonesia is decreasing due to the pandemic.
Based on Indonesia's MIT problems, which showed income stagnation, failure to transition to a high-income economy, and weak competitiveness [7] , MIT became an essential topic for developing Indonesia to study and discuss.An analysis is needed to separate Indonesia from MIT.In the MIT analysis, it is closely related to the MIT conditions for each province and its transition from year to year, so one method that can be used is panel data regression [8] .
The panel data regression model was obtained based on cross-section data and time series [ 9 , 10 ].Data cross-section is data collected simultaneously with too many objects [ 11 , 12 ].The thing here can be an individual as well as a group.At the same time, time series data is data that has been collected at a specific time against an object [ 13 , 14 ].The estimation process in the panel data regression model has several models that can be formed due to differences in slope and intercept coefficients for each observation at a particular time.Because the slope and intercept coefficients are not the same, there are three ways to estimate panel data regression models: the Common Effect Model (CEM), the Fixed Effect Model (FEM), and the Random Effect Model (REM) [ 10 , 15 ].Several studies have examined and developed panel data regression models, including [ 4 , 16-22 ].This research focuses on analyzing what variables significantly affect MIT in Indonesia, which provinces have the most critical influence on MIT conditions in Indonesia, and their characteristics from year to year so that this Province can be maximized.Its potential is a reference for other provinces to affect conditions nationally, and Indonesia can be separated from MIT.

A. Panel regression
The panel data regression model is formed from panel data [23] , which consists of observations on the same cross-sectional units or individuals over several periods (time series) [9] .The general panel data structure is shown in Table 1 .
The regression model equation using cross-section data can be written in Eq. (1) .
With  = 1 , 2 , … , n , where n is the number of data cross sections.Then the regression model equation using time series data is written in Eq. (2) .
With  = 1 , 2 , … , T , where T is the number of data time series.So that in general, the panel data regression model can be written in Eq. (3) . Where:

B. Regression model estimation
The Common Effect Model (CEM), the Fixed Effect Model (FEM), and the Random Effect Model (REM) are the three approaches that are typically utilized in the process of estimating panel data regression models [10] .Estimating parameters for CEM and FEM using Ordinary Least Squares (OLS) [24] .When estimating the REM parameter, used Generalized Least Squares, while the intercept FEM is expressed by a dummy variable (GLS).
(1) Common Effect Model (CEM) CEM is the same approach as estimating a multiple linear regression model.The principle of CEM is to regress all data combined without regard to time and individual effects [25] .In CEM, the intercept is stated to be constant or the same for each individual or at any time.The general Equation of the regression model from CEM is written in Eq. ( 4) .
(2) Fixed Effect Model (FEM) a. FEM is a panel data regression estimation method that is used because it can accommodate the characteristics between individuals, which are adjusted through the intercept.The model can be estimated using dummy regression where each individual and time will be a dummy variable.Several types of FEM models are described as follows: a. FEM has a constant slope but variations in the intercept coefficient for each individual.This model ignores the effects of time, but there are different effects between individuals [25] .The formula with the following conditions is written in Eq. (6) .(6) with: : individual response variable, i th and t-period  0 : individual FEM model intercept   : the slope of the i th individual   : the slope coefficient of the k-th predictor variable   : dummy category for the i th individual   : the k-th predictor variable, the i th individual, and the t-period   : individual residual, i th and t-period b.FEM with a constant slope, but there is a variation of the intercept coefficient at any time.The model ignores the effects on individuals, but there are effects at different times.The formula with the following conditions is written in Eq. (7) .(7) with: 0 : intercept the FEM time model   : the slope of the t-th time c.FEM considers both effects, namely the individual effect and the effect of time.The formula with the following conditions is written in Eq. (8) . with: * 0 : intercept the FEM time model   : the slope of the i th individual   : the slope of the t-th time REM is the last approach in estimating panel data regression models.This model arises from FEM, which does not represent the actual model, so REM is used with the general Equation written in Eq. (9) .
=   +  ′    +    (9) With    =    +   .   is the combination of error components between cross-section and time series, and   is the error component of the cross-section data.

C. Best model selection
Several tests were performed to determine the appropriate estimate of the panel data regression, including: (1) Chow Test The Chow test is a test to choose between the CEM and FEM models with the following hypothesis formulation: Test statistics: with: 2  : coefficient of determination for the FEM model  (2) Hausman Test The Hausman test was carried out after the Chow test concluded that the FEM model was the best.Furthermore, it is necessary to test to select the best model between the FEM and REM models with the following hypothesis formulation: with: : beta matrix of the FEM model β : beta matrix of the REM model The critical area is rejected  0 if  >  2 ( ;  −1 ) .(3) Lagrange Multiplier (LM) This test is used to choose between the CEM and REM estimation models under the following hypotheses: Test statistics: The critical area is rejected . Life expectancy X 3 Foreign Investment X 4 Gross Enrollment Rate X 5 Open Unemployment Rate X 6 Gross Fixed Capital Increase

D. Parameter significance test
Parameter significance testing was conducted to determine whether the predictor variable significantly affected the response variable [ 26 , 27 ].
(1) Simultaneous Test A simultaneous parameter significance test is a parameter significance testing method that is carried out to find out how the influence of all predictor variables on the response variable with the following hypothesis formulation: The critical area is rejected  0 if  ℎ >  ( ∕2;  ;  −  −1 ) , with the test statistic in Eq. (13) .
(2) Partial Test A partial parameter significance test is a method of testing the significance of parameters carried out to find out how each predictor variable influences the response variable with the following hypothesis formulation: The critical area is rejected  0 if  ℎ >  ( ∕2;  −  −1 ) , with the test statistic in Eq. (14) .

E. Data sources
The data used in this study is secondary data obtained from the Badan Pusat Statistik and the World Bank.The data is from observations made for 33 Provinces in Indonesia from 2010 to 2020.The variables used include the response variable (  ) and predictor variable ( ), detailed in Table 2 .

F. Data analysis technique
Data analysis in this study consisted of the following steps: (1) Perform data preprocessing.Convert GRDP per capita to US$ with the Atlas conversion method.
(2) Describe the characteristics and influencing factors of MIT in each Province of Indonesia.
(3) Check whether there is a multicollinearity assumption problem.(4) In the process of carrying out Panel Data Regression Modeling, estimations were made concerning the parameters of the Common Effect Model (CEM), the Fixed Effect Model (FEM), and the Random Effect Model (REM) regression models.(5) We select the best model with the Chow Test, Lagrange Multiplier, and Hausman.(6) Perform a significance test of the regression model parameters from the results of selecting the best model.

A. Data exploration
The data exploration stage is the first step to finding out the characteristics of the research data.The benefits are that it will do reading and understanding the data easier.The characteristics of MIT in Indonesia are shown in Fig. 1 .Indonesia in 2020 is the 34th year of being stuck at MIT after starting in 1986, being in a Lower Middle-Income Country with a per capita GRDP of $3983.As previously mentioned 2019, Indonesia entered the UM class but returned to LM status in 2020.The pandemic, which is suspected of having an impact on reducing Indonesia's income class, also occurred in Provinces in Indonesia, where most provinces experienced a decrease in GRDP per capita from the previous year during 2019-2020, and only ten provinces experienced an increase in per capita income.The gap in per capita income between provinces in 2020 is relatively large, starting from the lowest, namely East Nusa Tenggara (US$ 1344), to the highest, namely DKI Jakarta, which is (US$ 18,217).DKI Jakarta has a per capita income almost 14 times the per capita income of East Nusa Tenggara.Conditions in 2020 have decreased compared to 2019.There is a difference in the three provinces that dropped the grade from UM in 2019 to LM in 2020.The three provinces that have decreased rates are Jambi, East Java, and Bali.This is why Indonesia dropped from Upper Middle (UM) Income to Lower Middle (LM) Income Country in 2020, allegedly closely related to the COVID-19 pandemic.
Based on Fig. 2 , the relationship between the response variable  with predictor variables  1 ,  2 ,  3 ,  4 ,  5 , and  6 is positively related, which means that the greater the values of the predictor variable, the greater the probability that an area will be detached from MIT because the GRDP per capita will be more significant.

B. Multicollinearity detection
The multicollinearity assumption is critical before doing the modeling.In this study, the Variance Inflation Factor (VIF) value was used to determine whether there were cases of multicollinearity between predictor variables.
Table 3 shows that all predictor variables have a VIF value of less than 10, meaning there is no multicollinearity problem.

C. Panel data regression analysis
The estimated MIT model will consist of three models: the Common Effect Model (CEM), the Fixed Effect Model (FEM), and the Random Effect Model (REM).The next step is to choose the best model from the estimated models.The initial step is the Chow test to determine the difference between the CEM and FEM models.Chow test calculations obtain test statistics 207.25 with a p-value of 0.000, then the decision  0 is rejected, which means that the FEM model is the best.
Next, we will be continued with the Hausman test to choose between the FEM or REM models.It is known that the results of the Hausman test with test statistics obtained are 22.94 with a p-value of 0.000, then the decision  0 is rejected, which means that the FEM model is the best.The Chow and Hausman test results conclude that the appropriate estimation method is FEM.Furthermore, the parameter significance test will be carried out simultaneously and partially.There are at least two models in the FEM model that will be tested because they have an R 2 value of more than 70 %, namely individual FEM and individual-time FEM.
Simultaneous tests on the individual FEM model obtained the decision  0 was rejected because a p-value of 0.000 means that at least one predictor variable significantly affects MIT.Next, a partial test will be carried out to determine the impact of each variable shown in Table 4 .
Based on Based on the estimation results of the individual effect FEM model, the goodness of the model is obtained by 97.32 %, which means that the variables can explain the response variable, namely the MIT of 97.32 %  1 ,  2 ,  3 ,  4 ,  5 , and  6 .
Simultaneous testing on the individual-time FEM model obtained the decision H0 being rejected because a p-value of 0.000 was obtained, meaning that at least one predictor variable significantly affected MIT.Next, a partial test will be carried out to determine the impact of each variable shown in Table 5 .
Based on the estimation results of the individual-time effect FEM models, two predictor variables were not significant, with a p-value greater than  of 0.05.Variables that are not significant will be removed from the model, and individual-time FEM model estimates will be carried out using significant variables.
Based on the estimation results of the individual-time effect FEM models without including the variables  1 and  5 into the model, the goodness of fit from the model is 97.67 %, which means that the MIT variable of 97.67 % can be explained by the variables  2 ,  3 ,  4 , and  6 .The model that has been obtained is then tested to determine whether it meets the assumptions that the residuals are identical, independent, and normally distributed.Some of these tests will be carried out on the best model obtained in the following.

2
: coefficient of determination for the CEM model  : number of individual units  : number of periods  : number of predictor variables The critical area is rejected  0 if  >  ( ;  −1;  (  −1 )−  ) .

Table 2
Description of study variables.

Table 3
VIF Values for each predictor variable.

Table 4
Individual FEM partial test.

Table 5
Individual-Time FEM partial test.